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CN \ Abstract 

We present an alternative method for calculating likelihoods in molecular phylogenetics. Our 
method is based on partial likelihood tensors, which are generalizations of partial likelihood vectors, 
as used in Felsenstein's approach. Exploiting a lexicographic sorting and partial likelihood tensors, 
it is possible to obtain significant computational savings. We show this on a range of simulated 
data by enumerating all numerical calculations that are required by our method and the standard 



approach. 
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1 Introduction 



In his landmark paper (jFelsensteinl . 119811 ). Felsenstein popularised the method of maximum like- 
lihood for phylogcnetic estimation. Crucial to practical implementation was the introduction of a 
"pruning algorithm" , which, under the assumptions of a reversible Markov process on a tree, al- 
lowed for efficient computation of the likelihood of observed molecular sequences. Since this time, 
there has been an explosion in the use of the maximum likelihood method in phylogcnetic studies. 
There have also been numerous algorithmic developments including computation with more gen- 
eral models (Boussau fe Gouv . 2006: Yang, 1997) and heuristics speedups ac ting at the likelihood - 
step (IGuindon fc Gascuell . l2003t IStamatakisl . 120061 ) or the tree search level (jGuindon fc Gascuel . 
2003t IWhelanl . 12007 ). However, at the likelihood step, the basic algorithmic implementation still 
proceeds by applying Felsenstein's original recursive formula. 

We present an alternative method for computing likelihoods given a tree, a root distribution 
and a set of transition matrices. The particulars of the transition matrices do not concern us, as our 
results are independent of the model of sequence evolution. Hence, we will assume these matrices 
are given, and concentrate on computational complexity at the likelihood step. This is well 
justified, as, aside fro m the tree search, the likelihood step is the most intensive part of maximum 
likelihood estimation (jBrvant et all I2005T ) . Given the task of calculating of the likelihood of a 



single site in a sequence alignment, the method we present can actually be more cost intensive 
than applying Felsenstein's recursive formula. The effectiveness of our method becomes apparent 
by considering that in practice one is never calculating the likelihood of just a single site, but must 
calculate the likelihood of each and every site. We will show that our method offers significant 
computational savings; a speedup of up to a factor of 6 for the most favourable of realistic cases. 

As noted by Felsenstein, the most obvious cost-saving measure follows by observing that many 
site patterns in an alignment occur multiple times and there is no need to recalculate the likelihood 
each time. The process of identifying these common sites is called "aliasing" . Aliasing aside, the 
basic premise of our method is that a large number of sites in an alignment are often very similar 
to each other (two sites share most states in common and differ on only 1 or 2 of the sequences). 
This is certainly true for realistic data sets, as they have evolved from a common ancestor in the 
not-too-distant past (at least by hypothesis if not in fact). We define partial likelihood tensors 
(PLTs) as multi-dimensional arrays that generalize Felsenstein's partial likelihood vectors (PLVs). 
Using these PLTs, it becomes advantageous to sort the sites lexicographically and retain a few 
of these PLTs as the likelihood of each site is calculated. These PLTs can then be returned to 
when it comes to computing the likelihood of the next site, resulting in a minimization of the total 
number of calculations required. 

An important aspect of our approach is that the lex icographic ordering can be computed 



exactly and efficiently using an 0(N(m + fc)) radix sort (jCormen et all 120011) . where N is the 



number of unique site patterns in the alignment, m is the number of sequences and fc is the 
number of possible character sta tes. This should be compared to the related "column sorting" 
approach of IPond fc Musd ( 20041 ). where, as the calculation moves through the alignment, each 
PLV from the present site is retained. Depending on how the next site differs from the present 
site, some of the PLVs required for the next site will be identical to those retained, and hence 
some superfluous computation can be avoided. This approach relies upon a (heuristic) solution of 
a travelling salesman proble m (TSP) to fin d an ordering of sites that maximizes the saving. The 
solution of the TSP used bv IPond fc Musd is 0(N 2 ), so we can expect that their technique works 
best for shorter sequences. This is the direct converse of the approach we present here, which is 
at its greatest power, relative to other approaches, when the sequence length is long, such that 
the data is very heterogeneous. 



Another related approach is implemented bv lStamatakis et all ([20051 ) using "Subtree Equality 
Vectors" (SEVs) . This approach extends the idea of aliasing to the subtree level: faced with the 
need to calculate the likelihood on a given subtree, a sweep through the corresponding sequences in 
the alignment is performed, counting occurrences of homogeneous sub-patterns. (A homogeneous 
pattern is one in which each character state is identical.) Only the homogeneous patterns are 
accounted for as a general count would not amortize well with large data sets, and the SEVs must 
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Figure 1: Septet tree 



be recomputed for each alternative tree. iStamatakis et al\ are primarily interested in the problem 
of likelihood computations for data sets with very many taxa (10 3 and above). 



2 Methods 

Here we present our method, retroML, for computing the likelihood of molecular sequence data 
under the assumption of a Markov model on a rooted binary tree. We do this with the aid of an 
example for a septet tree (Figure [T]). In Appendix lAl we give an example for a tree with 16 leaves 
and in Appendix [B] we give a generic presentation, valid for trees of any size. We will also discuss 
the computational complexity of retroML as judged against Felsenstein's approach T . 

We consider an alignment of m sequences, with no gaps. (It is straightforward to modify 
our results for when there are gaps, depending on how they are to be dealt with.) A "pattern" 
will be the (ordered) sequence of states that occur at a given site in a sequence alignment. The 
actual numeric values of the Markov model parameters will be of no concern to us: our results 
depend only on the number m of leaves of the tree, its topology, the number k of states, and the 
number N of unique patterns in the alignment. Thus, we take the transition matrices defining 
the Markov process on the tree as given (with one matrix for every vertex excluding the root) 
and consider the complexity of computing partial likelihood vectors at the root, conditioned on 
the patterns observed at the leaves. Rather than present empirical timing results, which are 
dependent on computer hardware and/or programming language, we will compare an exact count 
of numerical operations required of retroML to that of T. For this purpose, we define a "cost" 
of a computation as a pair representing the numbers of multiplications and additions required, 
respectively: s(-) = [s*(-), s + (-)] for retroML and /(*) = [/*(')> / + (0] f° r the standard approach T. 



2.1 Partial likelihood tensors 

In Felsenstein's method J 7 , a partial likelihood vector (PLV) at a vertex represents the likelihood 
of observing each of the k possible states at that vertex, conditional upon a pattern of states at 
the leaves of the subtree subtended by that vertex. For the i th pattern, these PLVs are computed 
recursively by implementing the formula 
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where v is an internal vertex with children u\ and U2, and M a ^ is the probability of a transition 
from state a to b along the edge connecting v and Uj . For a PLV at a leaf, the entry that corresponds 
to the state present at that leaf is set to 1, whilst the other entries are set to 0. This recursive 
formula plays a vital role in the efficient implementation of all likelihood calculations in molecular 
phylogenetics, and it is exactly this recursion that we aim to supplant with our approach. 

The basic components underlying retroML are partial likelihood tensors (PLTs), which can 
be thought of as generalizations of partial likelihood vectors. A PLT represents the likelihood of 
observing arbitrary states on multiple vertices of a tree, conditioned upon observing certain states 



at some of the leaves. Notationally, for a subset of vertices labelled as (vi, . . . , v r ) with state a^ 
at vertex Vi, and a subset of m < to leaves labelled as (£1, . . . ,£m) with state Xi observed at leaf 
£i, we express the corresponding PLT as 

^:; a -^ ) (x 1 ...x A ). 

The "rank" of a PLT is here defined as the number of vertices r, and, given that the states a% are 
free to range over any of k values, we see that a PLT encodes k r numbers. Note that a PLV is a 
rank 1 PLT, but the converse is not true in general: a rank 1 PLT is not necessarily a PLV. 

To judge performance, we will compare the cost of computing, at the root of a tree, the PLVs 
of the observed patterns in an alignment using T and retroML. Equipped with these PLVs and a 
root distribution 7r(a), the likelihood of each site is calculated the same way for both methods: 



L i :=^r t \a)n{a). 



Finally, the negative log-likelihood of the alignment is calculated by 

N 

— InL := — y ailnLi, 
»=1 

where a, is the number of times the i th pattern occurs in the alignment. 

Under conditions that we will clearly delineate in N3- H it is possible to achieve significant com- 
putational speedups by employing partial likelihood tensors. This is made possible by observing 
that PLTs can be used to compute likelihoods that are conditioned upon subpatterns of arbi- 
trary length. For example, in an alignment of 5 sequences with the pattern at the first site being 
X1X2X3X4X5 it is possible that there will exist another pattern X\X 2 XsX^X^ that has the same 
first three states. If we have the PLT that is conditioned on observing the subpattern XiX 2 X%, 
then we see that we can save on computations by invoking this PLT when calculating the likeli- 
hood of X1X2X3X4X5. This cannot be achieved with PLVs alone as the particular subpatterns 
they can incorporate are constrained by the tree topology. An exception occurs when the tree 
topology is a "caterpillar" (completely unbalanced tree), for which retroML can be implemented 
using PLVs only (we discuss this case in detail below). 

We demonstrate retroML with an example calculation on the septet tree in Figure [TJ retaining 
PLTs conditioned upon subpatterns of arbitrary length. Without loss of generality, we consider Vi 2 
to be the root of this tree and begin our computation at leaf £\ . At each step we move to append 
the state at the next nearest leaf into the subpattern, as this helps to keep the rank of the PLTs 
minimal (this is an important consideration for reasons that we will discuss later). Additionally, 
we retain only the PLTs that occur just before a leaf state is appended. This ensures that when 
we return to a PLT for a subsequent pattern, the optimal saving in computation is achieved. 

retroML begins by computing the PLT conditioned on the observed state X\ at l\ and an 
arbitrary state at vertex w 8 (see Figure Ufa)): 



®^(X 1 ) = M* 1 



aX ± - 



Note that this PLT is not a PLV, as it is conditional upon only the state at one of its children; 
the corresponding PLV at v$ would be conditioned on the states at both children. This PLT can 
be used to incorporate the subpattern X\X 2 whilst moving over to v 10 , retaining arbitrary states 
at vg and summing over the states at v$ to give the rank 2 tensor (see Figure [^b)) 

*i™ \ Xl X 2 ) = \y j Mll^\X 1 )MlA AC". 

From here, extending to the subpattern X\X 2 X^ is a simple computation (see Figure HJc)): 

^:r io \xxx 2 x s ) = ^°\x x x 2 )mz* Xs . 




Vg I Vg 



"12 



I'll 



I'll) 




(a) tt< 08) (Xi) 



h. 



V V8 "g \ 



'-'12 



'-'11 



"lO\ 




(b) ^r io \xix 2 ) 




(c) ^'^(X^X 



4. 



v«8 Vg 



''12 



Wll \ 



'-'10 




(d) ^ ll) (X 1 X 2 X 3 X 4 ) 




(d) ¥ a vi2 \x 1 x 2 x 3 x 4 x 5 ) 



«1. 



."8 "9 



''11 



''10 




(e) *i" 12) (X 1 X 2 X 3 X 4 X 5 X 6 ) 



Figure 2: Partial likelihood tensors 



Here we see that the generality of the partial likelihood tensors is needed: It is not possible to 
compute a likelihood for the subpattern X1X2X3 using PLVs only; it is vital that arbitrary states 
at both vg and v\q are allowed for, and this requires a rank 2 tensor. For this tree topology, 
there is simply no other way around this: without allowing for arbitrary states at i>io, one cannot 
incorporate the state X 4 into the final likelihood and, likewise, arbitrary states must be allowed 
at Vg else the leaves £5,^ and £7 be n eglected. (These considerations are intimat ely tied to the 
separability of the tensors in question (JLandsberg fc Manivel l2008t ISumnerl . 120061 ).) 

Incorporating the subpattern X1X2X3X4, and then moving over to vi± requires summing over 
the states at vertices vg and uio (see Figure [^d)): 



*(^)(X 1 X 2 X 3 X 4 ) = EC'*i"r" I0) (^^3)M l 



,b 
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Note that this PLT is not the PLV that would occur at vn, as it is not conditioned upon X5. 
Extending to X1X2X3X4X5 and moving over to V12 requires summing over the states at v\± 



(see Figure |2Je)): 

^\x 1 x 2 x 3 x i x 5 ) = J2M2^ { : il \x 1 x 2 x 3 x i )M v j X5 . 

a' 

Incorporating the subpattern XiX2X^X^X^Xq is simple (see Figure [2{f)): 

^ 12) (x 1 x 2 x 3 x i x 5 x 6 ) = ^\x 1 x 2 x 3 x 4 x 5 )M^ Xe , 

and, finally, we compute at v\ 2 : 

^^{X^X^X^Xj) = *i Ull) (*l*2*3*4*5*6)M^ 7 . 

One can check that this is exactly the PLV that Felsenstein's method J- would calculate at v\ 2 . 

From our procedure we see that we can save on computations if the above PLTs are retained 
for subsequent patterns. For example, if the next pattern under consideration shares its first 
five states in common with a pattern that has already been dealt with, then we can return to 
^ a 11 {XiX 2 XzXiX^) and the effective size of the tree that must be traversed to calculate the 
PLV of the pattern is only 2. 

This concept will hold in general for retroML: If the pattern under consideration shares its 
first rh states in common with a pattern for which the likelihood has already been computed, the 
effective size of the tree that must be traversed is {m—rh) . Clearly, this approach has the potential 
to save on significant amounts of computation. 

As we discuss in detail below, the attractiveness of the approach is offset by the number of 
computations required in each step, which is 0(k r ), where r is the rank of the PLT involved, 
and, for worst-case tree topologies, this rank can become prohibitively large. This is exactly what 
Felsenstein's post-order recursive method J- avoids. A secondary issue for retroML is that the 
memory requirements of keeping all these PLTs in memory could easily become prohibitive for 
a large number of observed patterns. This issue is easily addressed: by tackling the observed 
patterns in a certain order, only the (to — 1) PLTs from the previous pattern need to be retained 
in memory. We next describe how this can be achieved. 

2.2 Lookbacks 

We refer to an ordering of the leaves of a tree as a "leaf order". For a tree with to leaves, there are 
of course to! possible leaf orders. The leaf order will be set by the order in which our algorithm 
visits the leaves, and thus sets the order of the states in the patterns. 

Given a list of patterns and a leaf order, a "lookback value" o~i is to minus the position of the first 
state (counting from 0) where the (i— l) th and i th patterns differ, with <j\ —m. The lookback value 
of the current pattern tells us exactly which PLT to return to: there is no additional computation 
involved. Additionally, considered as character strings, a lexicographic ordering of patterns will 
minimize the sum of the lookback values. This follows from the definition of lexicographic ordering. 
Since the effective tree size is reduced to <7j, our approach will perform very well when the pattern 
under consideration has a small lookback value. If the patterns are sorted lexicographically, we see 
that only the PLTs from the previous pattern need be retained in memory for best-performance 
to be attained. This also solves the secondary issue raised above, as retaining only the PLTs from 
the previous pattern in memory keeps the memory requirements constant. 

To get an idea of the distribution of lookback values that may occur in an alignment, we 
consider the idealised case where every possible pattern occurs. For an alignment of m sequences, 
the number of possible patterns is k m . If the patterns are sorted lexicographically, then the number 
of patterns with lookback value a — m is k and the number with lookback value < a < (to — 1) 
is k m ^(k-l). We note that 
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k m ^(k-l) = k + {k m -k m - x ) + (k m - x -k m - 2 ) + ... + (k 3 -k 2 ) + (k 2 -k) = k r ' 
i 



In the above example we took a particular traversal of the tree, allowing for the computation 
of the PLTs conditioned on subpatterns of sizes 1 to 7. The performance of retroML is highly 
dependent on the way in which the tree is traversed; at all times it is important to minimize the 
rank of the resulting PLTs. For a given tree, there is an optimal path that keeps the rank of the 
PLTs minimal and, in general, this path is not unique. For instance, in the above example, there 
would be no difference in the ranks attained if we started at any of the leaves £i, li,l%i ^4, ^6, @7'- m 
these cases the highest rank attained is 2 and this occurs only once. If, however, we started at £5 a 
rank 2 PLT would be required twice: once on the vertices vg, t>io and once on the vertices vn, U12. 
Additionally, the performance of retroML depends on the structure of the observed patterns after 
sorting; the smaller the value of the lookbacks, the better. We showed above that if all the 
possible patterns are present, then the number of patterns with a lookback of a is 0(k m ~ a ' +1 ), 
from which we see that if high rank PLTs occur in the implementation then it is better that they 
are encountered early, for large lookback values. For instance in our example, starting at any of 
the leaves ^1,^2,^3,^4 will outperform starting at l§ or £7, because in the former the rank 2 PLT 
is encountered when cr=5, whereas in the later the rank 2 PLT is encountered when <r = 4. 

In any practical implementation, the interaction between the dependence on tree topology and 
the lookback structure of patterns is complicated. We will show in §3.11 that the performance 
for different tree topologies is quite distinct; the ranks of the PLTs attained tell all. Whereas, 
the effect of the observed patterns is dominated by their number; judged only by their lookback 
values, the patterns occurring in sequence alignments are more-or-less random. 

We have found a particular tree traversal that takes into account these issues and works well in 
practice. We do not prove that this is the best possible tree traversal, but provide it in Appendix IBl 
as a useful heuristic. It is designed to ensure that the rank of the PLTs encountered is kept minimal 
by at all times ensuring the method moves to the next nearest leaf. In Appendix [B] we also supply 
a useful heuristic for finding the best starting leaf based on the above considerations. 

2.3 Costs 

In this section we detail the exact costs of each of the steps involved in T and retroML. Because 
the particular model parameters present are of no consequence to the cost of computations, in this 
section we will omit all vertex labels from transition matrices, PLVs and PLTs. We also omit the 
site label i from the PLVs. 

So that we are fair in the comparisons we make, we note here that the computation of a partial 
likelihood vector at a vertex can be simplified if one or both of its children are leaves. In the case 
that the vertex is completely internal (i.e. neither child is a leaf; see Figure [3][a)) the calculation 
proceeds as 



L lnt (a) 
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which, to compute for 1 < a < k, costs 

f{L int ) = [k(2k+l),2k(k-l)}. 

However, if exactly one of the children of the vertex is a leaf (see FigureJSJb)), then the calculation 

is 



L halfint (a) = (j2M ab L(b) j • (j2 M - S -A 



which can be simplified to 



L halfint (a) = fj2M ab L(b)) -Max, 
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Figure 3: (a) An "internal" vertex; (b) A "half internal" vertex; (c) A "cherry" vertex 

with cost 

f(L h * lli ^) = [k(k+l),k(k-l)}. 

Similarly, if both children are leaves (making a "cherry" ; see FigureEtc)), the calculation simplifies 
further: 

L^^(a)=M aXl M aX2 

with cost 

Finally, if we place the root of a tree at a trifurcating vertex, we find in a similar manner that, if 
the root has w — 1 or 2 children that are leaves, the cost is 

/(L root ) = [k{(3-w)k + 2),k(3-w)(k-l)} . 

From these costs we see that the total cost of T for single site on a caterpillar tree (one cherry, 
(to — 4) half-internal vertices and a trifurcating root with 2 leaves) is 

/(singlesite) =[k+(m-4:)k(k + l) + k(k+2),0 + (m-A)k(k-l) + k(k-l)} 
= [(m-l)k+(m-3)k 2 , (m-3)Jfc(fc-l)] . 

This result is actually the same for all tree topologies. As any tree topology can be changed to 
any another by repeatedly removing single leaves and re-inserting them elsewhere on the tree, we 
show presently that this does not affect the total cost /(singlesite). In the case where a leaf is 
removed from a cherry vertex whose parent is an internal vertex, the difference in cost is 

/(L cherry ) + /(L int ) - f(L hallint ) = [k(k+l), fc(fc-l)] . 

If the leaf is removed from a triplet-from the cherry or otherwise, then the difference in cost is 

j^halfint) + ^cherry) _ ^cherry) = [fc(fc+l), fc(fc-l)] . 

So, no matter where the leaf is removed from, the difference in cost is the same. Now, by observing 
that the insertion of a leaf is exactly equivalent to the reverse of one these cases, we see that the 
cost of T is independent of tree topology. (See Figure [3] for an illustration of this.) 

Incidentally, if these simplifications are not taken into account, each PLV calculation takes 
the time of the PLV at a fully internal vertex. If this were the case, then (asymptotically in to) 
the calculation of the likelihood of each pattern would be slowed by a factor of (2fc+l)/(fe + l), 
which for fc = 4 is 1.8. We do not know if the standard phylogenetics maximum likel i hood software 
packages take account of these simplifications or not. It is clear that lPond fc Musel (|2004| ) did not 



take account of this in their analysis: they assigned a unit cost to each "tainted" vertex regardless 
of whether the vertex was internal, half internal or above a cherry. Clearly, taking these factors 
into account would affect the solution of their TSP problem and consequently the optimal column 
sorting for a given data set. 

The situation for our algorithm retroML is quite different as the cost is not independent of the 
tree topology; the extra information that is contained in the partial likelihood tensors allowing 
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Figure 4: (a) Evaluate leaf state whilst retaining vertex state; (b) Evaluate leaf state and sum 
over vertex states; (c) Move to new vertex whilst retaining state at current vertex; (d) Sum over 
states at vertex and move to adjacent vertex. 

the lookback step comes at additional computational cost. This additional cost depends on the 
rank of the tensor. 

If we start the algorithm at a leaf vertex, the first step costs nothing: 

V a (X l )=M aXl ; 

it is simply an assignment. After this beginning we assume that the current PLT is rank r and 
there arc four available generic moves: (a) and (b) apply when the current vertex is directly above 
a leaf, and (c) and (d) apply when the current vertex is completely internal. 

(a) Evaluate observed state at a leaf whilst retaining state at current vertex (see Figure 0J a)): 

^aia 2 ...a r *~ ^a 1 a 2 ...a r M ai x ■ 

This costs s(^ aia2 ... ar ) = [k r , 0] and the rank is unchanged: r <— r. 

(b) Evaluate observed state at a leaf and sum over state at current vertex (see Figure Sfb)): 



"1 
•„r j.r-1 1 



This costs s{^ a2 ...a T ) — [k r ,k r (k — 1)] and the rank is reduced by one: r <— r — 1. 

(c) Move to adjacent vertex whilst retaining state at current vertex (see FigureHfc)): 

^ aaia,2...a r * ™ aia,2...a r Maid' 

This costs s(^ aaia2 ... ar ) — [k r+1 ,0} and the rank is increased by one: r <— r+1. 

(d) Sum over states at vertex and move to adjacent vertex (see Figure Bid)): 

^ aa2-..a r * / ^ a\a 2 ...a r ^a\a- 

0,1 

This costs s(^ a a 2 ...a r ) = [k r+1 , k r (k — 1)] and the rank is unchanged: r <— r. 



We note that in each of these moves for retroML, as well as for the steps of J-, the number 
s + , f + of additions is strictly less than the number s*,f* of multiplications. Thus, for ease of 
presentation, from here on we will keep track of (and compare) counts of multiplications only. 

F inally, we note that the lexicographic ordering of patterns can be achieved using a radix 



sort (jCormen et all 120011 ). which is 0(m(N + k)). We assume that the time it takes to do this 
sort will not be significant to our analysis. We justify this by noting that, for each candidate 
tree, the lexicographic ordering of the data set need only be computed once, whereas to optimize 
model parameters the likelihood calculation must be iterated many times. In fact, the number 
of iterations that must be performed on a given tree scales with the number of edges on the 
tree. Thus, accounting for these iterations, the complexity of the standard implementation T is 
0(Nm 2 k 2 ). 

We can also counter point the cost of ordering the data set to the "column sorting" performed 
bv lPond fc Musd ( 2004T ). In that approach, an 0(N 2 ) approximate solution to a TSP was required 



to sort the data set. Given that in most applications m < N, we can argue strongly that if the 
column sorting approach can achieve significant speedups, regardless of the need to solve the TSP, 
we certainly expect that our approach will do so also. 

Breaking tack from the main thread of this article, which favours exact counts of empirical 
timings, we tested these assertions by timing likelihood and pattern sorting co mputations on a 
personal computer. For instance, while the likelihoo d calculation using PAU P* (SwoffordJ, [2003) 



on an arbitrary tree constructed from a plant data set (jGoremykin et aL , 2003) of 15 sequences and 



31k b ase pairs took approximately 0.33s, sorting the patterns in R (|R Development Core Team 



20061 ) took approximately 0.005s on the same machine. This was without taking advantage of the 



possibilities of radix sorting. 

3 Results 

3.1 Simulation study 

Using the counts presented in the previous section, we conducted a simulation study comparing the 
number of multiplications r equired by retroML to tha t of Felsenstein's method T . We generated 
DNA alignments with Filo ( Charleston fc Hold 12008 ) for each tree size in the range m — A to 15 



and each sequence length in the range 1 to 10 . The trees were randomly generated using the 
Yule model (birth only process with pendant edge lengths drawn from a uniform distribution). 
We used the HKY Markov model with flat parameter settings and imposed a molecular clock 
with root-to-leaf height of .32 substitutions per site. For each sequence alignment we found the 
multiplicative cost of computing likelihoods at the root on another randomly tree generated under 
the same conditions. The reason we generated a new tree is that we did not want there to be any 
biasing effect from the lookback structure caused by the true tree (in unpublished tests we showed 
that any such effect is not detectable anyway.) 

In Figure [5] we plot the ratio f*/s* against the number of unique patterns observed. In the 
quartet and quintet case, the speedup is always greater than or equal to 1. This is guaranteed as, 
in those cases, the tree is always a caterpillar and the rank of the PLTs never becomes greater 
than 1. In the sextet case, the two possible tree topologies are clearly noticeable (the caterpillar 
systematically giving a greater speedup than the balanced case). For larger trees, the effect of 
different topologies on the cost is not noticeable in the plots as it is overpowered by sampling error. 
(In unpublished tests we could detect this difference by fixing sequence length and running many 
more trials.) One may worry that retroML can sometimes take longer than J 7 , but it should be 
noted that this is only the case for short sequences, and in these cases the time taken to compute 
the likelihood is so small that this will not be of importance in practice. It is the performance 
of retroML for large sequence lengths, where the overall computing time is longer, that shows its 
power. 

The plots show clearly that retroML performs very well for small trees of up to 9 leaves. 
However, for larger trees, the performance of retroML begins to degrade significantly. It seems 



that the effectiveness of our method is confounded by the additional costs involved with high rank 
PLTs required by large trees. Using the heuristics presented in Appendix[Bl balanced trees present 
the worst case in this regard, with the maximum rank required bounded from above by log 2 (m/2). 
Although this is quite a good bound, we see from the plots that the effect of high rank PLTs is 
rather debilitating, and there seems little use in implementing retroML on trees with more than 
15 leaves (this message is bourne out by unpublished results). One would hope that for arbitrarily 
large trees there is some way of taking advantage of the favourable performance of retroML on 
small trees. In [|4]we will discuss two approaches to how this may be achieved. 

3.2 Caterpillar trees, exact results 

For a caterpillar tree, since all the internal vertices are next to a leaf, the rank of the PLTs never 
becomes greater than one. In this case the cost for a single pattern of our approach is exactly 
equal to that of standard approach. This means that our approach will always be faster when 
there is more than one site in the alignment. In the quartet and quintet cases there is only the 
caterpillar topology, hence our approach will always be superior on trees of this size. If every 
possible pattern is present in the alignment, we can calculate exactly the cost of retroML on a 
caterpillar tree, as follows. 

The multiplicative cost s* of retroML for each pattern with lookback 1 < a < (to — 2) is 
s* = [(m-er)fc+(m-2-cr)fc 2 ], for a — the cost is Sq= [(m-l)fc+(m-3)fc 2 ] , and for a = (m— 1) 
the cost is s* m _ 1 = k. If every possible pattern occurs in the alignment, the total cost for any to is 
then 



fm—1 



s*(alldatacaterpillar) = 
k- [{m-l)k+(m-3)k 2 } + ( ^ k a {k-l) ■ [(m-cr)k + (m-2-a)k 2 ] ) + k Tnr - 1 (k-l) ■ k, 

which, with a little finessing using geometric series, becomes 

m— 3 

s*(alldatacaterpillar) = fc 3 (fc + l) ^2 k l = k 3 (k + l) 



m ~ 3 k m ~ 2 — 1 



7=0 



fc-1 



Thus, in the case of a caterpillar tree and all possible patterns being present, the speedup that 
retroML provides is 

/*(alldatacaterpillar) _ (fc-l)fc m - 2 [(m-l) + (m-3)fc] 
s*(alldatacaterpillar) ~ (fc+l)(fc m " 2 -l) 

This is an 0(mk) speedup. For quartets (m = 4) this speedup works out to 4.48, exactly as 
indicated in Figure[5fa). For quintets (m — b) this speedup works out to 7.31, and for to = 50 the 
speedup would be 140. 

Of course, there is simply no way, even for modest values of to, that in practice a sequence 
alignment will contain any more than a (very) small number of the possible patterns. For instance, 
taking m = 10 and k = 4, the number of possible patterns is ~ 10 6 and for to = 15 the figure 
becomes ~ 10 9 . Even with maximally heterogeneous data, these numbers are well out of the reach 
of practical sequence alignments. 

4 Large trees 

As we saw in §3.1[ retroML is only effective on realistic alignments when the number of sequences 
is less than about 16. This is partly because the proportion of patterns present in an alignment 
of fixed length quickly becomes extremely small for large trees, but mostly because of the issues 
associated with the high rank PLTs required. Clearly retroML will be very effective for quartet 
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Figure 5: Speedups f*/s* for sequence length 1 to 10 4 
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puzzling or other supertree methods reliant on performing maximum likelihood only on small 
subtrees. We can only but recommend that in these cases our approach be adopted. 

Here we will discuss how the favourable performance of our algorithm on small trees can be 
exploited to achieve speedups for maximum likelihood computations on arbitrarily large trees. We 
have two different ideas as to how this could be achieved. 

Divide 

The idea behind this approach is to divide a "large" tree of size m into q "small" subtrees of size 
close to m/q. For the root of each of these subtrees, we can use retroML to compute the PLV for 
each subpattern in the data set corresponding to that subtree. Then, the likelihood of the whole 
tree can be computed in the usual way using Felsenstein's recursion T on the internal part of the 
tree. The effectiveness of this approach hinges on the fact that, as the tree gets larger, the size of 
the internal part of the tree, that is to be computed using J-, increases at the same rate as the 
number of subtrees on the outer part that are to be computed using retroML (both effects are 
linear in q). Thus the speedup obtained using our approach on the subtrees scales with the size 
of the whole tree and this will result in a tangible overall speedup for arbitrarily large trees. 

When implementing this approach, the PLVs at the root of each subtree and for each pattern 
must be retained in memory, and there will be additional bookkeeping involved in bringing together 
the relevant PLVs when it comes time to traverse the internal part of the tree. 

Considering possible problems on the memory side of things, we note that in the standard 
implementation, N likelihoods (at the root) and 2(m— l)fc 2 entries in the transition matrices must 
be retained in memory, which is an 0(N + mk 2 ) memory requirement. Employing the approach 
described above, using q subtrees, requires us to record (iVi+iV2 + . . -+N q ) PLVs, where iV; is the 
number of unique subpatterns associated with the i th subtree. In the (rather unlikely) worst case 
we have (N1+N2 + ■ ■ . + N q ) = qN, so the memory requirements on the internal part of the tree 
are 0(qNk). This could certainly be a large increase in memory requirements, but this is a very 
conservative estimate. Additionally, it is linear in the size of the tree m and hence will not get of 
hand for large trees. 

With regard to numerical computations, for the standard implementation let /^ be the mul- 
tiplicative cost of computing the PLV at the bifurcating root of a subtree of size rh. We use our 
previous result that this cost is independent of tree topology and note that a caterpillar tree with 
a bifurcating root has (m — 2) half-internal vertices and a single cherry, so that 

fl = (m-2)k(k+l) + k. 

If, for subtrees of size rh, retroML provides an average speedup of su„ : =/4/ s rfi' then the speedup 
(on average) for a tree of size m will be 

N(qfl + (q-l)k(2k+l) + k(3k+l)) 



m V(suT i V4 + ( (Z -l)fc(2fc + l) + fc(3fc+l))- 
Taking the limit q — > 00, we find that the speedup approaches 

rhX 



m (rhX — l)/sum + 1' 

where A:= (fc + l)/(2fc + l). For instance, if we set fc = 4, rh — 6 and assume a retroML speedup of 
su 6 = 1,2,3,4 and 5 (see Figure [5fc) for justification), the corresponding speedups achieved for 
extremely large trees would be 1, 1.83, 2.53, 3.13, 3.69 and 4.41, respectively. 

Restrict 

The idea behind this approach is to restrict the rank of the PLTs to some maximum value R. This 
can be achieved by "breaking" the algorithm at any point where the next move will result in a 
PLT of rank greater than R, as follows 
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Figure 6: Tree with 16 leaves 

Suppose the method breaks at vertex v. To continue, the PLT at v is retained and the algorithm 
jumps directly to the next nearest leaf £. Next, the PLT conditioned upon only the leaf state at £ is 
computed, and the standard procedure of computing PLTs is continued from here until vertex v is 
reached. At this point, the method combines (by multiplication on the relevant index) the current 
PLT with the PLT that was retained previously at v. From here the method can continue as 
normal until the maximum rank is exceeded once again or the calculation terminates. Computing 
likelihoods in this way can be made to give completely equivalent results to Felsenstein's approach. 

We illustrate this idea by considering the multiplicative cost of computing the likelihood for 
a completely balanced tree (Figure [6]) with m = 16 leaves. We do this using retroML with no 
restriction of the rank, which requires rank r = 3 PLTs, using retroMI^, which restricts to rank 
2 PLTs, and (hi) using retroMLi, which restricts to rank 1 PLTs. The formulas for the PLTs in 
each case, alongside multiplicative costs, are presented in Appendix |XJ The attraction of (iii) is 
that this will result in a method that will always be as fast or faster than Felsenstein's approach, 
no matter how little data there is. This is because the additional computations required by high 
rank PLTs are completely avoided, whilst still taking advantage of the lookback structure of the 
patterns. 

If every possible pattern occurs in the alignment, then, using the multiplicative costs presented 
in Appendix lAl and setting fc = 4, we find that the speedup for case (i) is 35.78, for case (ii) the 
speedup is 35.84, and for case (iii) the speedup is 34.63. This is a very positive result not least 
because restricting to rank r < 1 captures 97% of the speedup of the best case (ii). This is 
important for practical implementation because restricting to rank r < 1 will always be as least 
as fast as !F, regardless of the size of the data set. 

To illustrate the effectiveness of this method for realistic circumstance, we tested the cases (i), 
(ii) and (iii) on simulated data using the tree in Figure [6] and the model conditions described in 
§3.11 We found (Figure[7|) that for realistic sequence lengths (1 to 10 4 ), it is case (iii) that clearly 
performs best. This is because, for this range of sequence length, there is simply not enough data 
for the additional overhead of the unrestricted and rank 2 approaches to be worthwhile. 

5 Discussion 

In this article we have presented an alternative method for computing likelihoods in molecular 
phylogenetics. We have shown that with the simultaneous use of partial likelihood tensors and a 
lexicographic ordering of sites, we can achieve significant computational speedups. We did this 
by a careful examination of the number of numerical computations that arise in both our method 
and the standard approach based on Felsenstein's recursion. This type of analysis gives a rehned 
perspective on the efficiency of the algorithms used in phylogenetics. We showed that it is useful to 



13 










2000 4000 6000 

number of unique patterns present 



8000 



Figure 7: Speedups for balanced tree with m = 16 leaves (2,001 trials, sequence length evenly 
spaced from 1 to 10 4 ). Light grey is (unrestricted) retroML, dark grey is retroML2, and black is 
retroMLi. 



do more than simply state orders of complexity; for instance we uncovered that the computational 
cost of an application of Felsenstein's formula at a particular vertex depends on where that vertex 
lies on the tree. This observation has significant repercussions for anyone interested in writing 
efficient maximum likelihood software, and it is not clear which existing packages take account of 
this, as frequently the source code is unavailable. 

A clear path for further analysis of our approach would be to pinpoint exactly under what 
circumstance it is worthwhile to use high rank PLTs. Our final example on a 16 leaf tree was 
telling, as even for the case of all possible patterns present in the data set, the optimal maximum 
rank was 2, not 3 as occurs naturally while using retroML. One can envisage an algorithm that 
examines all the relevant structure of a sequence alignment and proceeds in the calculation of 
the likelihood using the absolute minimum number of computations. Considering the "restrict" 
modification of retroML we presented in [JH one could plausibly design a more refined method 
which, for each vertex on the tree, sets the maximum rank of the PLTs depending upon the 
density of data associated with the subtree subtended by that vertex. This may be difficult 
to implement and of theoretical interest only, but it raises the question of exactly what is the 
minimum number of calculations required to compute the likelihood of a part icular data set. A 
clue to the impracticality of such a scheme comes from lStamatakis et al\ ([20051 ) and their decision 
to only consider homogeneous subpatterns; sometimes the cost of extra pre-computation required 
by a clever method is simply not worth all the trouble. 

Although not examined in this article, PLTs also hold significant promise for improving per- 
formance under tree perturbations. This is because the partial likelihood tensors can encode a 
substantial amount of information that is invariant under tree perturbations. For instance, con- 
sider the rank 2 PLT represented in FigureEJb). Suppose that there was a larger subtree rooted at 
i>io, and we were interested in tree perturbations that altered this subtree and the leaves €5, £q and 
£7, but left the rest of the tree unchanged. Clearly this PLT is invariant to any such perturbations 
and could be retained throughout. In this way, the power of high rank PLTs could be applied 
directly to e fficient calculations during tree perturbation in close analogy to the "tree swapping" 
technique of lGuindon fc Gascuell (|2003f ) using PLVs. 
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*i"» ■" 23) (*i ■ • • x r) = * ( :r V23) (*i ■ ■ ■ X*)M$ 7 s* = e 

^,v 25 ,v 2e ) {Xi _ _ _ Xs) s *=k 3 + 3k 2 

= (£„ M%> (E, ^r 23) (*! • • ■ X 7 )M« 8 )) MifM^ 

y ( 2 4 ' V25 ' V26) (Xi... x 9 ) = ^: b T V25 - V2e \x 1 . . . X s )Mfl s* = fc 3 

^^\ Xl . . . X W ) = £ fe , (Erf * { 3tr'-~XXi ■ ■ ■ X 9 )A4 w x \ o ) MgP s* = 2fc 3 

*ir ,Var) (*i ■ • • Xn) = *i" 6 M " 27) (^i • • ■ *io)M <£ S * = fc 2 
*^\ Xl . . . X l2 ) = (£ a , M™ (E„ *i^" 27) (^i ■ ■ • XuWvxj) M^f s* = 3fc 2 
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*$™(*i . . . X 14 ) = £ a , M^ (£ 6 , W 29j (^i ■ ■ ■ X 13 )M^> i4 ) s* = 2k 

*i" 3o) (Xj . . . X 15 ) = *i W30) (X a . . . X 14 )M^ 5) 5 s* = k 
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Table 1: Calculating PLTs for balanced tree with no restriction on rank. 
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A Restricted retroML 

Here we present the computation of PLTs conditioned on subpatterns of size 1 to 16 for the 

phylogenctic tree Figure [6] using retroML and its "restricted" versions retroML2 and retroMLi. 
(i) With no restriction upon the rank of the PLTs, retroML proceeds as displayed in Table [TJ 
(ii) With rank restricted to r < 2, retroML2 proceeds by "breaking" in order to visit £5 and 

to visit £9 as displayed in Table O 

(iii) With rank restricted to r < 1, retroMLi proceeds by "breaking" in order to visit the leaves 

^3,^5,^77^9,^11 an d ^13 as displayed in Table[3] 

B Pseudo-code 

Here we present pseudo-code for implementing retroML generically. First we present heuristics 
for finding the best way to traverse the tree given that it is best to minimize the rank of PLTs 
that arise during implementation. 

Given a vertex v, we define /i mm (iO as the minimum distance to a leaf: 

ft min(^) :=mi n(|P(t>,^)|), 

where L is the leaf vertices and P(v, u) is the tuple of vertices that lie on the path from vertex v 
to u. We denote the subtree subtended by a vertex v as T„. 
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Tabic 2: Calculating PLTs for balanced tree while restricted to rank r < 2. 

The f avouriteChild of a vertex is chosen first based on minimum ^ mm ; next, if this is not 
unique, based on the smallest subtended subtree; finally, on minimizing the sum of all the /i m i n 
on all vertices in the subtended subtree. Given a starting leaf, getTraversalOrder returns the 
traversal of a tree that retroML takes by following the f avouriteChild heuristic. 

Another heuristic, bestStartingLeaf, attempts to find the optimal place for the retroML 
method to start. This is done by taking into consideration the comments of H2.2I We choose 
leaves such that the maximum /i mm of the subtrees hanging off the path P(£q,£-\) is minimized, 
and then we ensure that the maximum /i mm of these subtrees occurs early in the tree traversal. 

(v) 

retroML itself proceeds given a tree T and a transition matrix M^ b for every vertex excluding 
the root. For optimal performance, the patterns in the alignment should be sorted lexicographically 
with respect to the order of leaves in getTraversalOrder. 
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Table 3: Calculating PLTs for balanced tree while restricted to rank r < 1. 



favour it eChild(v) 

let the children of v be u±, u 2 

if /i nl j n (u i ) has a unique minimum at u a 

return u a 
else if |T U J has a unique minimum at Ub 

return Ub 

else if \^2 veT ^min( w )) nas a unique minimum at u c 

return u c 
else 

return arbitrarily chosen Ui 



badChild(v) 

let the children of v be u±, u 2 
if U\ = f avouriteChild(f) 

return u 2 
else 

return u\ 
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getTraversalOrder(^o) 

mark £q as "visited" and all other vertices as "unvisited" 

let v be the current vertex, so v <— £$ 

let £ be an ordering of vertices in T; initially £ <— (^o) 

[this will eventually be the traversal order, (£q, . . . , £j)] 
while there are any unvisited vertices 
if v has any children 

V <— favouriteChild(u) 
else 

set v to be the closest unvisited vertex to v 
mark v as visited and append it to £ 
return £ 



bestStartingLeaf 

let L be a set of leaves, each chosen arbitrarily from a cherry in T 
for each ii G L 

compute (£i, . . . ,ltf) = getTraversalDrder(i'i) 

let PW be the tuple (v\ . . . Ufc(i)) of vertices Vj that lie on the path from £, to £^ 

let S^ 1 ' be a tuple (si, . . . , SkU)) of min heights: Sj <— ft-min^i) 

let <^ = max(sj) 
let qo = min q^ 

remove all leaves £i from L such that q^ > qq 
if L = {£} return £ 
for each remaining l± in L 

consider each tuple S^> — (si, S2, ■ ■ ■ , Sfc(i)) as an integer: &j <— S1S2 . . . SkU) 
return £i for which bi is maximal, ties resolved arbitrarily. 



retroML 

let L be leaves of tree T 

compute £{) ^bestStartingLeaf (T) 

reroot T at ^0 

let ty a <— M^x ; this is the current working PLT 

'0 

let $7 be a list; this will be the list "lookback" PLTs 

let Qi <— * a 

let u <— f avouriteChild(^o); this is the current working vertex 

[now we get the method started by processing the first pattern patt-J 
process(w, patt x ) 
while there are patterns left to process 

get lookback a of current pattern patt^ 

let * *- O m _ CT 

, T , ,T,('Vl-V2 V r ) 

suppose * = wi l0 ' 2 ..: 0r 
let v <— vi 
process(w, pattj 
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process(w, patt) 
if v is not a leaf 

processGoodChild(v, patt) 
processBadChild(v, patt) 



processGoodChild(i>, patt) 
v «— f avouriteChild(v) 
if v is a leaf 
append \& to il 

*t::ar Vr) - tffc.X^M^; [see Figure Ha)] 
else 

*fc; ,tV) <- *fc:: a f r) MS; [see Figure^c)] 
process(i>, patt) 



processBadChild(v, patt) 
v *- badChild(ii) 
if v is a leaf 
append ^! to fi 

M/fc^ 5 - Ea*fe'" r) ^S„; Ne FigureHb)] 
else 

^iT^.V.a';^ - £ Q ^fe'^MiSl; [see Figure Hd)] 
process(v, patt) 



References 

BOUSSAU, B. & Gouy, M. (2006). Efficient likelihood computations with nonreversible models 
of evolution. Syst. Biol. 55, 756-768. 

Bryant, D., Galtier, N. & Poursat, M.-A. (2005). Likelihood calculation in molecular 
phylogcnctics. In: Mathematics of Evolution and Phylogenetics (GASCUEL, O., ed.). Oxford 
University Press, pp. 33-62. 

Charleston, M. A. & Holt, K. A. (2008). Filo: a program to generate molecular sequences 
under general conditions. University of Sydney, 
http : //it . usyd. edu. au/~mcharles/software/filo . 

Cormen, T. H., Leiserson, C. E., Rivest, R. L. & Stein, C. (2001). Introduction to 
Algorithms, Second Edition. McGraw-Hill Science/Engineering/Math. 

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood ap- 
proach. J. Mol. Evol. 17, 368-376. 

Goremykin, V. V., Hirsch-Ernst, K. I., W ? olfl, S. & Hellwig, F. H. (2003). Analysis of 
the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal 
angiosperm. Mol. Biol. Evol. 20, 1499-1505. 

Guindon, S. & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large 
phylogcnics by maximum likelihood. Syst. Biol. 52, 696-704. 



19 



Landsberg, J. M. & Manivel, L. (2008). Generalizations of Strassens equations for secant 
varieties of Segre varieties. Communications in Algebra 36, 405-422. 

Pond, S. L. K. & Muse, S. V. (2004). Column sorting: Rapid calculation of the phylogenetic 
likelihood function. Syst. Biol. 53, 685-692. 

R Development Core Team (2006). R: A Language and Environment for Statistical Comput- 
ing. R Foundation for Statistical Computing, Vienna, Austria. 

Stamatakis, A. (2006). RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with 
thousands of taxa and mixed models. Bioinformatics 22, 2688-2690. 

Stamatakis, A., Ludwig, T. & Meier, H. (2005). RAxML-III: A fast program for maximum 
likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456-463. 

Sumner, J. G. (2006). Entanglement, Invariants, and Phylogenctics. PhD thesis, University of 
Tasmania, http://eprints.utas.edu.au . 

SwOFFORD, D. L. (2003). PAUP* : Phylogenetic Analysis Using Parsimony (* and Other Meth- 
ods), Version 4-0bl0. Sinauer Associates, Sunderland, Massachusetts. 

Whelan, S. (2007). New approaches to phylogenetic tree search and their application to large 
numbers of protein alignments. Syst. Biol. 56, 727-740. 

Yang, Z. (1997). PAML: a program package for phylogenetic analysis by maximum likelihood. 
Bioinformatics 13, 555-556. 



20 



